FNP - Operations - Cost Optimization & Karpenter Spot Instances
Summary (Explain Like I’m 5)
Running servers costs money. You pay per hour for compute:- Premium VM (always on): $2/hour
- Spot VM (spare capacity): $0.40/hour (80% cheaper!)
Technical Deep Dive
Cost Structure Analysis:Mermaid Diagrams
Key Terms
- Spot Instance → AWS spare capacity; 80% cheaper but can be interrupted
- On-Demand Instance → Guaranteed availability; standard pricing
- Consolidation → Karpenter automatically merges pods and deletes idle nodes
- Node Utilization → % of CPU/memory used on node
- Bin-Packing → Optimize pod placement to minimize nodes needed
- TTL (Time-to-Live) → Karpenter consolidation checks every 1 hour
- Capacity Type → on-demand vs spot
- Cost Optimization → Automated tradeoff between cost and reliability
Q/A
Q: What happens if a spot instance gets interrupted? A: Karpenter has interruption handling: (1) AWS sends 2-minute notice, (2) Karpenter drains pod gracefully (moves to another node), (3) Pod rescheduled on on-demand if needed, (4) Operation succeeds (brief latency spike). Typical RTO ~30 seconds. Q: Can I lose data if spot instances are terminated? A: No. Pods are stateless (database is separate). PostgreSQL runs on managed RDS (guaranteed availability). Pods are replaceable. If pod terminated, work moves to another node. No data loss. Q: How does Karpenter decide between spot and on-demand? A: Uses cost optimization algorithm: (1) Predict pod duration (long = spot risky), (2) Compare cost vs reliability need, (3) Spot for batch/non-critical, on-demand for user-facing. Configurable per workload. Q: Is there a “tipping point” where spot becomes too risky? A: Yes. If interruption rate >5% or latency SLA <99.5%, prefer on-demand. FNP targets 99.99% availability, so 60% spot max recommended. Use on-demand for critical user requests. Q: How are costs tracked and attributed? A: Karpenter exports metrics: karpenter_nodes_cost_per_hour, karpenter_pod_cost. Grafana dashboards visualize. AWS billing integration tags resources by workload. Finance can track cost per feature/customer. Q: What’s the maximum savings possible? A: 80% reduction on compute if 100% spot (risky). Realistic: 40-50% with mix (70% spot, 30% on-demand for critical). Savings compound with: consolidation (-20%), efficient scheduling (-10%), reserved instances (-15% additional).Example / Analogy
Ride-Share Cost Analogy: Traditional Deployment (Always Premium):- Take Uber Black (premium car) every day
- Cost: 1,250
- Always available, never wait
- UberX (spot/cheap): 70% of trips = 210
- Uber Black (on-demand): 30% of trips = 375
- Total: $585/month
- Savings: $665/month (53%)
- Trade-off: Sometimes UberX temporarily unavailable, reroute to Uber Black
Cross-References: Deployment Architecture, Observability, Scaling Strategy Category: Operations | Cost Optimization | Infrastructure | DevOps Difficulty: Intermediate ⭐⭐⭐ Updated: 2025-11-28